Requirements¶

  1. Remove all row that have "No" or "blank" in Column R.
  2. Remove duplicate IP addresses (only leave a single IP based on timestamp, should be the first entry)
  3. Remove row if ALL columns "T,U,V,W,X" are 6 and under. (I want to exclude any parents that have all of their children under the age of 6). 4.Remove any row if ALL columns "T,U,V,W,X" are 19 or over (I want to exclude any parents that have all of their children older than the age of 19)

Visualize

  1. Create a bar graph for column "BG" this should be a scale from 1-5 so it will be 5 bar graphs 2.Create a pie Graph for column " BD" .
In [1]:
# Importing Required Libraries
import pandas as pd
import matplotlib.pylab as plt
plt.style.use('ggplot')
In [2]:
# Loading Data in to Pandas Dataframe
df = pd.read_csv('Data_Survey.csv',skiprows=[1,2])
In [3]:
# Viewing First 5 Rows
df.head()
Out[3]:
StartDate EndDate Status IPAddress Progress Duration (in seconds) Finished RecordedDate ResponseId RecipientLastName ... Q14 Q15 Q16 Q17_1 Q18 Q19 Q20 Q21 Q21_7_TEXT Q22
0 4/4/2022 12:09 4/4/2022 12:09 IP Address 71.176.116.110 100 9 True 4/4/2022 12:09 R_3iOTGKdgC3iKADr NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 4/4/2022 12:09 4/4/2022 12:17 IP Address 71.176.116.110 100 469 True 4/4/2022 12:17 R_3kgJOdcw2NW5yLy NaN ... NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN
2 4/4/2022 12:17 4/4/2022 12:24 IP Address 71.176.116.110 100 415 True 4/4/2022 12:24 R_3meha5WatdczRqh NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4/4/2022 12:24 4/4/2022 12:28 IP Address 71.176.116.110 100 253 True 4/4/2022 12:28 R_pGEDdFForjtw9xv NaN ... NaN NaN NaN 3.0 NaN NaN NaN NaN NaN NaN
4 4/5/2022 19:04 4/5/2022 19:05 IP Address 71.115.229.93 100 97 True 4/5/2022 19:06 R_eEeXiduEZEDzSx3 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 65 columns

In [4]:
# Dropping All Null values in Q2 column
df.dropna(subset=["Q2"], axis=0, inplace=True)
In [5]:
# Filtering all rows which are equal to Yes
df = df[df['Q2'] == 'Yes']
In [6]:
df = df.sort_values(by='EndDate') 
In [7]:
# Dropping all duplicate IPAddress keeping the first 
df.drop_duplicates(subset='IPAddress', keep="first")
Out[7]:
StartDate EndDate Status IPAddress Progress Duration (in seconds) Finished RecordedDate ResponseId RecipientLastName ... Q14 Q15 Q16 Q17_1 Q18 Q19 Q20 Q21 Q21_7_TEXT Q22
168 4/10/2022 0:28 4/10/2022 0:28 IP Address 194.182.8.81 9 6 False 4/17/2022 0:28 R_3kNi7S7qB8ixPSc NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
169 4/10/2022 10:28 4/10/2022 10:29 IP Address 173.171.17.193 18 31 False 4/17/2022 10:29 R_XGNLE1n8XUPkqbv NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
58 4/10/2022 14:09 4/10/2022 14:13 IP Address 108.20.152.110 100 225 True 4/10/2022 14:13 R_2bPs0M8EkNErq2l NaN ... Northeast Yes Yes 3.0 Graduate or professional degree (MA, MS, MBA, ... $150,000 or more 1923.0 Working (paid employee) NaN Management, professional, and related
171 4/10/2022 14:42 4/10/2022 14:43 IP Address 73.17.224.217 36 60 False 4/17/2022 14:43 R_2PaEJgsGQyPJHH9 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
173 4/10/2022 16:09 4/10/2022 16:10 IP Address 71.174.126.166 18 63 False 4/17/2022 16:10 R_e3Z87iJLpTC8KIx NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
141 4/8/2022 9:08 4/8/2022 9:09 IP Address 163.116.135.114 18 45 False 4/15/2022 9:09 R_3Icb95xa1FwfQQr NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
34 4/8/2022 9:49 4/8/2022 9:52 IP Address 67.242.27.6 100 157 True 4/8/2022 9:52 R_Z8iDkYwlA4rg4SJ NaN ... Northeast Yes Yes 3.0 Graduate or professional degree (MA, MS, MBA, ... $75,000-$99,999 13214.0 Working (paid employee) NaN Service
52 4/9/2022 16:11 4/9/2022 16:18 IP Address 74.69.106.171 100 405 True 4/9/2022 16:18 R_2zIx0HaoRH3XVli NaN ... Southwest Yes Yes 4.0 Some high school or less Less than $25,000 13205.0 Not working (disabled) NaN NaN
50 4/9/2022 2:26 4/9/2022 2:33 IP Address 24.128.116.150 100 388 True 4/9/2022 2:33 R_25vcxKvPInRKpPe NaN ... West Yes Doesn't make a difference 3.0 Bachelor’s degree $100,000-$149,999 80501.0 Working (self-employed) NaN Sales and office
51 4/9/2022 8:33 4/9/2022 8:38 IP Address 107.77.223.18 100 302 True 4/9/2022 8:38 R_1kOdhmJiF5NV7l9 NaN ... Northeast No Yes NaN Graduate or professional degree (MA, MS, MBA, ... $150,000 or more 13078.0 Working (paid employee) NaN Management, professional, and related

250 rows × 65 columns

In [8]:
# Filtering all rows where Q4_1 is greater than 6 and less than 19
df = df[df['Q4_1'] > 6.0]
df = df[df['Q4_1'] < 19.0]
df = df.reset_index() 
df.shape
Out[8]:
(96, 66)
In [9]:
# Dropping all rows where age is greater than 6 and less than 19 in other columns
rows = df.index[[2,6,7,8,9,10,20,24,30,35,38,40,41,42,47,53,88,94]]
df.drop(rows, inplace=True)
In [10]:
df.head()
Out[10]:
index StartDate EndDate Status IPAddress Progress Duration (in seconds) Finished RecordedDate ResponseId ... Q14 Q15 Q16 Q17_1 Q18 Q19 Q20 Q21 Q21_7_TEXT Q22
0 58 4/10/2022 14:09 4/10/2022 14:13 IP Address 108.20.152.110 100 225 True 4/10/2022 14:13 R_2bPs0M8EkNErq2l ... Northeast Yes Yes 3.0 Graduate or professional degree (MA, MS, MBA, ... $150,000 or more 1923.0 Working (paid employee) NaN Management, professional, and related
1 178 4/10/2022 21:58 4/10/2022 21:59 IP Address 174.242.131.198 18 19 False 4/17/2022 21:59 R_3oTprtbTdAv5MRp ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 65 4/11/2022 15:25 4/11/2022 15:27 IP Address 128.230.194.156 100 141 True 4/11/2022 15:27 R_22qXuvuOs8iCQQK ... Northeast Yes Yes 4.0 Graduate or professional degree (MA, MS, MBA, ... $50,000-$74,999 13244.0 Working (paid employee) NaN Management, professional, and related
4 180 4/11/2022 6:38 4/11/2022 6:39 IP Address 174.192.0.57 18 26 False 4/18/2022 6:39 R_12areyXnWfxZDsR ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 62 4/11/2022 8:44 4/11/2022 8:48 IP Address 128.230.79.168 100 291 True 4/11/2022 8:48 R_2QJlWph9SoEoY0N ... Northeast Yes Doesn't make a difference 1.0 Graduate or professional degree (MA, MS, MBA, ... Prefer not to say 13090.0 Working (paid employee) NaN Management, professional, and related

5 rows × 66 columns

In [11]:
df.drop(df.columns[[0]], axis = 1, inplace = True)
In [12]:
# Saving file in csv format
df.to_csv('Data_Survey_Cleaned.csv',index=False)
In [13]:
# Creating Bar Graph of BG Column
df['Q17_1'].value_counts().plot(kind='bar',title="BG Column Bar Chart", backend = 'plotly')
In [14]:
# Creating Pie Chart of BD Column
df_pie =  df.Q14.value_counts().plot(kind='pie',title="BD Column Pie Chart")